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Abstract 

Background: Human cancer is caused by the accumulation of somatic mutations in tumor suppressors and 
oncogenes within the genome. In the case of oncogenes, recent theory suggests that there are only a few key "driver" 
mutations responsible for tumorigenesis. As there have been significant pharmacological successes in developing 
drugs that treat cancers that carry these driver mutations, several methods that rely on mutational clustering have 
been developed to identify them. However, these methods consider proteins as a single strand without taking their 
spatial structures into account. We propose an extension to current methodology that incorporates protein tertiary 
structure in order to increase our power when identifying mutation clustering. 

Results: We have developed iPAC (identification of Protein Amino acid Clustering), an algorithm that identifies 
non-random somatic mutations in proteins while taking into account the three dimensional protein structure. By 
using the tertiary information, we are able to detect both novel clusters in proteins that are known to exhibit mutation 
clustering as well as identify clusters in proteins without evidence of clustering based on existing methods. For 
example, by combining the data in the Protein Data Bank (PDB) and the Catalogue of Somatic Mutations in Cancer, 
our algorithm identifies new mutational clusters in well known cancer proteins such as KRAS and PI3KCa. Further, by 
utilizing the tertiary structure, our algorithm also identifies clusters in EGFR, EIF2AK2, and other proteins that are not 
identified by current methodology. The R package is available at: http://www.bioconductor.Org/packages/2.1 2/bioc/ 
html/iPAC.html. 

Conclusion: Our algorithm extends the current methodology to identify oncogenic activating driver mutations by 
utilizing tertiary protein structure when identifying nonrandom somatic residue mutation clusters. 



Background 

Cancer is one of the most widespread and heteroge- 
neous diseases imposing a huge toll on patients, rela- 
tives, friends, and society. However, at its most basic, it 
is a genetic disease that is caused by the accumulation 
of somatic mutations in oncogenes and tumor suppres- 
sors [1]. While mutations in tumor suppressors tend to 
down-regulate the activity of genes that prevent can- 
cer, mutations in proto-oncongenes either up-regulate 
or deregulate the activities of the resulting proteins. So 
far, pharmacological intervention has shown to be more 
successful at inhibiting activating oncogenes than restor- 
ing tumor suppressing gene function. Coupled with the 
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idea of "oncogene addiction", that many cancers rely on 
mutations in a small subset of key genes to be able to 
continue their uncontrolled growth while the remainder 
of the mutations constitute passenger mutations [2,3], the 
problem of identifying activating oncogenic mutations has 
received great attention in cancer research. 

Recently, several studies have shown support for the 
hypothesis that activating somatic mutations tend to clus- 
ter in protein kinases [2,4,5]. Further, as observed by 
[6], mutational clusters might provide further information 
regarding where to look for activating mutations, reducing 
the driver mutation search space needed to be analyzed. 
Moreover, mutational clusters that lead to either bene- 
ficial or detrimental phenotypic changes may point to 
regions that are under positive or directional selection as 
well as regions that are functionally significant and thus 
can be targeted by protein engineering [7]. 

So far, several methods based upon the number of muta- 
tions in a specific region have been developed to detect 
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potential driver oncogenic mutations as well as natu- 
rally selected regions. One common method hypothesizes 
that driver mutations have a higher non- synonymous 
mutation rate as compared to the background mutation 
rate [5,8]. Further, one can look at the ratio of non- 
synonymous (K a ) to synonymous (K s ) changes per site, 
[9]. A criterion for selection is then to check if ^ > 1, 
based on the hypothesis that the benchmark neutral rate 
of nucleotide substitution is exceeded when positive selec- 
tion also contributes to the substitution process. Similarly, 
[10] proposes a hypothesis that driver mutations have a 
larger mutational rate than the background mutational 
rate after gene length normalization. 

While the approaches mentioned above have had some 
success in detecting positive selection and/or identifying 
driver mutations, they nevertheless have several short- 
comings. First, many of them are dependent on calculat- 
ing the disparity in non- synonymous versus synonymous 
mutations but do not recognize that selection often occurs 
on very small sections of the gene and thus might fail when 
averaged over the entirety of the gene length. Second, the 
methods described above [9,10] do not make any attempt 
to distinguish between activating and non-activating non- 
synonymous mutations. 

In addition to the approaches described above, some 
researchers have focused on creating classifiers in order 
to determine mutation status. As described in [11], these 
algorithms employ a variety of machine learning tech- 
niques, such as Random Forests [12] and Support Vector 
Machines [13], to calculate a score for each mutation. 
These scores are typically calculated using a variety of 
information such as measures of evolutionary conserva- 
tion as well as physico-chemical properties such as size 
and polarity of substituted and original residues as well as 
surface accessibility. These scores are then used to clas- 
sify the mutation. For example, PolyPhen-2 [14] predicts 
whether a missense mutation is damaging while CHASM 
[15] attempts to discriminate between driver and passen- 
ger mutations. While several of these models have had 
significant success in classifying the mutation, they all 
require large and well annotated data sets in order to first 
train the machine learning classifier and then apply the 
resulting rule set. 

Recently, [6] developed Non-Random Mutational Clus- 
tering (NMC) to identify potential activating mutations 
by hypothesizing that, in the absence of heretofore known 
mutational hotspots, a mutational cluster is indicative of 
selection for an activating driver mutation since only a 
small number of precise mutations can activate a pro- 
tein [4,5]. By looking at the order statistics and assuming 
that the locations of amino acid mutations follow a uni- 
form distribution when the protein is considered in linear 
form under the null hypothesis, they identify clusters 
by calculating whether any two pair-wise mutations are 



closer together on the line than expected by chance alone. 
Despite its success, one limitation of the NMC method 
is that the proteins are treated as a linear sequence with- 
out considering the three dimensional structures of the 
proteins. 

In this work, we extend the NMC methodology to 
account for tertiary protein structure. This enables the 
identification of mutational clusters that are relatively 
far away in linear space but relatively close together in 
3D space. We proceed to show that our methodology is 
effective in identifying novel mutational clusters that are 
missed by NMC in key cancer proteins such as KRAS 
and PIK3G*. Unlike NMC, iPAC is also able to identify 
the EGFR and EIF2AK2 proteins as containing mutational 
clustering as well. We also show that many of the clusters 
identified by iPAC are predicted to be deleterious by well 
known machine learning algorithms such as Polyphen-2 
[14]. However, iPAC has the distinct advantage of requir- 
ing only the mutational positions and tertiary structure 
which allows its application to novel mutations and struc- 
tures for which extensive information and literature is not 
yet available. Finally, we also show that for a large per- 
centage of protein structures, the tertiary structure leads 
to a net reduction in mutational clusters found, thus pre- 
senting a simplified clustering mutational landscape. Ulti- 
mately, by providing a refined picture of the mutational 
clustering, we are are able to provide a more accurate rep- 
resentation of where potential activating mutations may 
reside within the protein. 

Methods 

Our method, named iPAC, uses a 4 step approach to find- 
ing mutational clusters. First, mutational and positional 
data are obtained from the COSMIC [16] and PDB [17] 
databases (described in Sections "Obtaining mutational 
data" and "Obtaining the 3D structural data", respectively). 
The mutational and positional information is then recon- 
ciled to allow a single numerical reference to identify the 
same physical amino acid in both databases (Section "Rec- 
onciling the structural and mutational data"). Next, Mul- 
tiDimensional Scaling (MDS) [18] is used to map the 
protein structure from 3D to ID space while preserving, as 
best as possible, all pairwise three dimensional distances 
between amino acids for a given protein (Section "Mul- 
tidimensional scaling"). The NMC algorithm is then run 
on the remapped amino acids to find mutational clusters 
(Section "NMC"). Finally, the clusters are mapped back 
into the original protein space and reported back to the 
user. In the following subsections we discuss each of these 
steps in detail. 

Obtaining mutational data 

Mutational data were obtained from the COSMIC 
database (version 58) via ftp.sanger.ac.uk/pub/CGP/cosmic 
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and implemented using Oracle. In order to justify the 
assumption that amino acids follow a uniform distribu- 
tion of mutation, only mutations that were found through 
whole gene screens were included. Further, we only used 
missense mutations that belonged to two categories: 1) 
"Confirmed somatic variant" or 2) "Reported in another 
cancer sample as somatic". All nonsense and synony- 
mous mutations as well as mutations that had dif- 
ferent somatic status categories were excluded. Fur- 
ther, as multiple studies can report mutational data 
from the same cell line, mutational redundancies were 
removed to avoid double counting. See "Additional file 1: 
Cosmic Query" for the SQL code and schema used to 
generate the data. Finally, in order to match mutational 
data with structural data, only the proteins for which a 
UniProt Accession Number [19] was available were kept. 
This resulted in 777 unique proteins. 

Obtaining the 3D structural data 

The protein structural data were obtained from the PDB 
database via http://www.pdb.org. As one protein can have 
several structures, for each of the 777 proteins described 
above, all the structures with a matching UniProt Acces- 
sion Number were obtained. If a specific structure had 
more than one polypeptide chain with a matching amino 
acid sequence in UniProt, the first matching chain listed 
was used (typically chain A). For proteins where the reso- 
lution was sufficiently high enough to provide more than 
one alternative conformation for a specific amino acid side 
chain, only the first conformation listed in the file was 
used. Once the appropriate side chain and conformation 
was selected, the (x,y,z) coordinates of all the a-carbon 
atoms were extracted and used to represent the 3D back- 
bone structure of the protein. In all, this process resulted 
in 1,904 structures. See "Additional file 2: Structure Files" 
for a full listing of the structures and side chains used for 
each protein considered. 

Reconciling the structural and mutational data 

Due to a different numbering system of the amino acids 
employed by the PDB and COSMIC databases, an align- 
ment needed to be performed in order to reference the 
same residue numerically in both databases. Two meth- 
ods in the iPAC package were designed to reconcile these 
differences, one based on pairwise alignment [20] and the 
other based on a numerical reconstruction from the struc- 
tural data obtained from the PDB. As there are often sig- 
nificant technical difficulties for such a reconstruction, for 
the rest of this paper, unless specifically noted, pairwise 
alignment was used to reconcile these elements. Please 
see the documentation in the iPAC package for a full 
description of these two methods. Successful alignment of 
mutational and positional data occurred on 140 proteins 
which corresponded to 1100 unique structure/side-chain 



combinations and 667 unique residue positions contain- 
ing 1,434 total mutations. We note that for any given 
structure/side-chain combination, if there is no positional 
data for a specific residue, the mutational data for that 
residue is not used. Please see "Additional file 2: Structure 
Files" for a full description. 

Multidimensional scaling 

As the underlying clustering algorithm is dependent upon 
the construction of order statistics, we used MDS [18] to 
remap the amino acids into one dimensional space while 
preserving (as best as possible) the pairwise distances 
between them in 3D space. Specifically, given an n x n 
dissimilarity matrix, 



A* 



/ <$1,1 <$1,2 * * ' 
<$2,1 ^2,2 * * ' 



\ <W 8 n ,2 ' ' ' &n,n / 



the MDS algorithm maps each 8^ into a corresponding 
distance d^QC) on a new m-dimensional metric space 
X. Formally, for a specific representation function, / : 
8i t j di f j(X), we have that the original dissimilarities 
are preserved in X, specifically, /(<$/,;) = dyiX). Here, 
/ can be either fully defined or chosen from a specified 
class of functions and is employed to handle the case 
when the proximity measures come from a space that is 
not necessarily a true metric space. Further, as it is not 
always possible to preserve the exact distance (for exam- 
ple, due to sampling effects, measurement precision or 
loss of dimensionality), rather than insist on /(<$/,;) = 
di f j(X), the MDS framework is typically set up such that 
^ di f j(X). Thus, by minimizing a badness-of-fit 
measure called raw stress = o r = [/(<$/,;) — d;, ; (x)] 2 , 
we identify the xi,...,x n , that preserve our distances in 
the new metric space X. However, raw Stress by itself is 
not always informative as it is subject to distortion by the 
choice of units used. For instance, if the scale used to mea- 
sure changes by a factor of 100, the raw stress will change 
as well but by a factor of 100 2 . Thus, Stress- 1, which is 
defined as: 



oi = 



N 



(1) 



and is not subject to unit distortion, will be minimized 
instead. 

For the purposes of this paper, the dissimilarity matrix is 
simply equal to pair- wise distance between any two amino 
acids in the protein. Specifically, the distance between 
residues i and denoted 8^, is taken to be the Euclidean 
distance between their respective a-carbon atoms. As 
Euclidean space is a proper metric space, from now on 
we assume that / is the identity function. Further, as we 
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require units along the line in order to calculate order 
statistics, the MDS algorithm will be applied such that we 
find xi, x n g M. . Thus, the MDS algorithm finds scalars 
xi, ...,x n such that \xt — Xj\ ^ 5^, for any two pairwise 
amino acids i and ; in the protein. We present an example 
when MDS is applied to the 3GFT structure of KRAS [21] 
in Figures 1 and 2 below. 

NMC 

We employed the NMC algorithm [6] to find the muta- 
tional clusters in one dimensional space. Specifically, con- 
sider a protein with N amino acids and that each amino 
acid has a uniform probability of ^ of mutation. Given 
m samples and n mutations, we are able to calculate the 
order statistics for every mutation (see Figure 3). Two 
mutations X(j) and are then defined to be clustered if, 
Pr(Cki = X(k) — X({)) < a. This probability is then calcu- 
lated for every pair of mutations and adjusted for multiple 
comparisons using either the Benjamini-Hochberg (BH) 
adjustment [22] or the Bonferroni adjustment [23,24], For 
the analyses performed in this paper, the more conserva- 
tive Bonferroni adjustment was used. Finally, it is impor- 
tant to note that the structural information obtained for 
each protein often does not include positional information 
on every amino acid within the protein. We removed these 
"missing" amino acids from the protein before running the 
NMC clustering algorithm so that we can compare iPAC 
and NMC on an equal basis. 

[6] derive closed form solutions to calculate the 
PriCfci = c) for c e {0,1,..., TV — 1}. However, as this 
becomes computationally inefficient, they suggest divid- 
ing Cki by N and assuming a continuous uniform distribu- 
tion on (0, 1). They then show that in the limit, the CDF 
becomes as follows: 
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Figure 1 KRAS a-carbons in 3D Space. 
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Figure 2 KRAS a-carbons mapped to the x-axis using MDS. 
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Thus, via Equation (2), we can directly calculate if 
two mutations are closer together than by chance alone 
quickly and efficiently. For a given structure, a cluster was 
considered to be significant using an a-level of 0.05 and 
the Bonferroni adjustment. Specifically, the p-value of the 
cluster must be < n ^iy 2 1 wnere n ( n + l)/2 are all the 
pairwise mutations considered. 
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Figure 3 An example of constructing the order statistics. 

Suppose we had 3 samples of a protein that is N amino acids long. If 
amino acid /' has a "*" above it, that indicates that the amino acid for 
that sample had a non-synonymous missense mutation. The samples 
are then collapsed together and the number of mutations for each 
residue is shown above the box on the right. These counts form the 
order statistics. The first mutation is on residue 2 Q<0) = 2), the next 3 
mutations are on residue 3 = X(3) = X(4) = 3) , the next 
mutation is on residue 5 (X( 5 ) = 5) and the last 2 mutations are on 
residue 6 Q((e) = ^(7) = 6). 
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Multiple comparison adjustment for structures 

In addition to the Bonferroni multiple comparison adjust- 
ment done by the NMC method, an adjustment is also 
required to account for testing multiple structures per 
protein. Since the structures for a given protein could be 
quite similar and thus lead to similar clustering results, 
a second Bonferroni adjustment would be too conserva- 
tive. Instead, a combined Bonferroni-FDR approach was 
performed as follows. First, for a given protein, the NMC 
reported p-value for a given cluster was multiplied by 
n ( n ~ l \ to calculate P*. Thus, on a per-protein level, P* 
represents the inverse Bonferroni adjustment performed 
by the NMC algorithm and thus allowed us to compare 
each clusters P* to an a-level of 0.05 to determine sig- 
nificance. To account for all the structures analyzed, we 
computed a rough FDR (rFDR) [25] which approximates 
the standard FDR method for a large number of positively 
correlated or independent tests. Under this approach, we 
estimate the expected value of a over all k tests and then 
use that as the significance threshold. The expected value 
a can be approximated by: 



rFDR = a * 



k+l 

2k 



where k is the total number of structures. In the case 
of the 1100 structures analyzed in this study, rFDR & 
0.02502. Finally, any clusters for which P* < 0.02502 
was deemed to be significant. For the rest of this 
paper, with the exception of Table 1, we only report 
the p-value to avoid confusion. Nevertheless, each clus- 
ter presented in Section "Results and discussion" is in 
fact significant after adjusting for structural multiple 
comparisons. 

Results and discussion 

Using the iPAC package, 215 of the total 1100 struc- 
tures analyzed were found to have significant clustering. 
When comparing iPAC with the original NMC method, 



out of the 140 proteins analyzed, both iPAC and NMC 
identified 8 proteins that contained significant clusters. 
However, iPAC also identified 3 new proteins as well, 
specifically EGFR, EIF2AK2 and HAOl. These 3 new 
proteins correspond to 10 of the 215 structures found 
to have clustering. iPAC also found structure 2ENQ for 
the protein PIK3CA to contain a significant cluster while 
NMC did not. The 8 proteins identified by both algo- 
rithms correspond to the remaining 204 structures. There 
were no proteins that were identified by NMC but were 
subsequently missed by the iPAC algorithm. Please see 
"Additional file 3: Results Summary" for a full listing of 
which structures and which proteins were found to be 
significant. 

As can be seen from Figure 4, approximately 70% of 
all the structures found to have significant clustering 
differed in the amount of clusters identified when com- 
paring iPAC vs NMC. This leads one to believe that in 
some cases, consideration of the tertiary structure iden- 
tifies additional clusters while in other cases, clusters are 
able to be removed, offering a simplified view of the 
mutational information. While it is outside the scope of 
this paper to consider every one of the 215 structures 
with clustering, we present three representative cases 
where integration of the tertiary protein structure into 
the analysis had a significant effect: 1) identification of 
mutation clustering in a protein that would otherwise 
be missed, 2) identification of new mutation clusters in 
a protein that was detected using the NMC methodol- 
ogy, and 3) reduction of the total mutational clusters in 
a protein that was detected using the NMC methodol- 
ogy. We also note, as can be seen in Table 1, that the 
p-value found for the most significant cluster is similar on 
the protein level. Proteins that had very significant clus- 
tering, such as KRAS and TP53, remain very significant 
when the tertiary structure is incorporated. Proteins that 
were less significant, such as IDE and AKT1, remain so 
as well. 



Table 1 A comparison of the most significant iPAC and NMC p-values from the 8 proteins that were picked up by both 
algorithms 



iPAC 



NMC 



Protein 


P-value 


P* 


P-value 


P* 


KRAS 


6.17E-185 


6.35 E-181 


4.39 E-233 


4.52 E-229 


TP53 


5.23 E-128 


6.11 E-123 


4.37 E-086 


5.30 E-81 


BRAF 


3.73 E-130 


1.01 E-126 


3.84 E-130 


1.04 E-126 


PIK3CA 


8.20 E-084 


3.58 E-80 


8.20 E-084 


3.58 E-80 


NRAS 


5.38 E-026 


6.46 E-24 


8.26 E-029 


9.91 E-27 


HRAS 


1.23 E-010 


5.54 E-09 


5.61 E-010 


8.42 E-09 


AKT1 


1.18 E-005 


7.08 E-05 


2.47 E-005 


7.41 E-05 


IDE 


2.20 E-005 


6.60 E-05 


1 .56 E-003 


4.67 E-03 



P* is calculated as described in Section "Multiple comparison adjustment for structures". 
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Figure 4 A comparison of NMC and iPAC over all the structures 
that were found to be significant. The number of structures in each 
category is shown along with the percentage. 



We note that 9 out of the 11 proteins that were found 
significant by iPAC had their most significant cluster over- 
lap a binding site, proton acceptor site or kinase domain. 
Of the remaining 2 proteins, the most significant clus- 
ter for PIK3CA overlapped amino acid 1047, which has 
been shown to ease the entrance of substrates and hence 
potentially increase the substrate turnover rate, a typi- 
cal oncogenic behavior [26]. For a detailed per protein 
description, please see "Additional file 4: Relevant Sites". 

Finally, we validated the performance of iPAC using two 
popular machine learning algorithms, PolyPhen-2 [14] 
and CHASM [15]. First, this validation must be consid- 
ered in light of the fact that these algorithms require a 
much more extensive set of information than iPAC. Nev- 
ertheless, over 98% of the amino acids that occurred in 
significant mutation clusters were also identified as signif- 
icant (with a FDR of < 20%) by Polyphen-2 and CHASM. 
For full details, please see "Additional file 5: Performance 
Validation". 

iPAC finds novel proteins 

As discussed Section "Results and discussion", three new 
proteins were identified by iPAC that were missed when 
tertiary structures are not accounted for. The EGFR pro- 
tein, a cell-surface receptor for epidermal growth factor 
family ligands [27], is perhaps the most well known and 
has been found in a wide array of cancers such as lung [28], 
anal [29] and glioblastoma multiforme [30]. Although 
seven EGFR structures were identified by iPAC to contain 
significant clustering, we will concentrate on the 2GS7 
structure [31] as it showed the most significant cluster- 
ing. As seen in Table 2, three significant clusters were 
found with cluster 3 being being a sub-cluster of cluster 1. 
Figure 5, shows the orientation of these clusters in three 
dimensional space. 

Overall, all the statistically significant clusters found 
deal with lung cancer pathology and an increase in kinase 
activity. The two mutations in cluster 2, G719S and T751I 
are both found in lung cancer with the first mutation 



Table 2 The three most significant clusters found in EGFR 
for the 2GS7 structure 



Cluster 


Start 


End 


Muts. in cluster 


P-Value 


1 


751 


858 


4 


1.35E-04 


2 


719 


751 


2 


2.41 E-03 


3 


790 


858 


2 


2.82E-03 



responsible for strongly increased kinase activity [32-34] 
and the second found in erlotinib responsive non small 
cell lung cancer patients (NSCLC) [35,36], respectively. 
Cluster 3 contains two mutations, T790M and L858R, 
both of which have been found in lung cancer and are 
known for increased kinase activity as well [32-34,37]. 
Finally, cluster 1 is comprised of clusters 2 and 3, with an 
additional mutation S768I which potentially shows a pos- 
itive clinical response to Getfinib in NSCLC patients [38]. 
It is interesting to note that both clusters 1 and 2, that 
are identified via statistical analysis, contain mutations 
that have been found to benefit from pharmacological 
intervention. Had the tertiary structure of EGFR not been 
taken into account, these clusters would not have been 
identified by the NMC algorithm. When the protein is 
viewed linearly, the mutations occur too far away from 
each other to result in statistically significant p- values. 

iPAC finds additional clusters 

One example where iPAC finds additional clusters is in 
the KRAS protein when analyzing the 3GFT structure a 
[21]. KRAS, part of the RAS set of of proteins which are 
involved in a large number of signaling cascades, is one of 




Figure 5 The EGFR Structure (PDB ID 2GS7)(structure color coded 
by region: 1) (cluster 1 - light blue and yellow, 2) (cluster 2 - blue 
and 3) cluster 3 - yellow. The boundary a-carbon amino acids of 
71 9, 75 1 , 768, 790 and 858 are shown as purple spheres (see Table 2 
for details of each cluster). 
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the most studied cancer oncogenes with activating muta- 
tions in approximately 17-25% of all human cancers [39]. 
While both NMC and iPAC identified many of the same 
clusters such as amino acids 12-13, 12-61 and 12-146, 
iPAC identified several novel clusters as well, specifically 
amino acids 61-117 and 117-146. We note that both algo- 
rithms specifically identify a cluster between residues 12 
and 146, and given that we only have positional data for 
167 residues, this signifies that there is one large clus- 
ter that covers « 80% of all the available amino acids. 
However, combined with the two novel clusters identified 
by iPAC, we are able to partition the protein into three 
distinct regions 1) 12-61, 2) 61-117 and 3) 117-146 that 
cover 30%, 34% and 18% of the protein respectively (see 
Figure 6). 

We also ran NMC and iPAC on each region separately 
to consider how the clustering results would be affected. 
As can be seen from Table 3, failure to account for the ter- 
tiary protein structure resulted in region 3 no longer being 
detected and region 1 losing significance by over ninety 
orders of magnitude. 

Further, while somatic mutations in region 12-61 have 
been found in many cancers such as colorectal, lung, pan- 
creatic and bladder [8,33,40-43], somatic mutations at 
amino acids 61, 117 and 146 have primarily been found 
in lung and colorectal carcinomas. Even more specifically, 
mutations at amino acids 117 and 146 (K —> N and A 
-> T, respectively) deal mostly with colorectal cancer [8] . 




Figure 6 The 3GFT structure color coded by region: amino acids 
1 3-60 are light blue, 62-1 1 6 are red and 1 1 8-1 45 are yellow. The 

boundary a-carbon amino acids of 1 2,61 ,1 1 7 and 1 46 are shown as 
purple spheres (see Table 3 for details of each cluster). 



Table 3 P-value for each region when the region is 
considered independently 



P-value 


Region 


NMC 


iPAC 


1) 12-61 


1.37E-11 


3.36E-105 


2)61-117 






3) 117-146 




3.35E-12 


2&3) 61 - 146 




3.31 E-05 


A "-" signifies that the region was not found to be significant. 



Thus, by taking into account the tertiary structure, the 
clusters identified by iPAC subdivide the protein along 
pathological lines. 

iPAC finds fewer clusters than NMC 

Of the 215 structures found to contain significant clus- 
tering, 86 structures were identified where iPAC found 
fewer structures than NMC. Three of these structures 
correspond to BRAF, 31 correspond to HRAS and 52 cor- 
respond to TP53. Here, we consider structure 3TV4 [44] 
for the BRAF protein as it contains the most significant 
cluster found by both iPAC and NMC. For this protein, 
it is well known that amino acid 600 is one of the most 
highly mutated residues. In our dataset, 60 of the 76 total 
mutations that fulfilled the requirements described in 
Section "Obtaining mutational data" occurred on amino 
acid 600. As expected, the most significant "cluster" is 
located solely on that amino acid, with an iPAC p-value 
of 3.73 x lO" 130 and an NMC p-value of 3.84 x lO" 130 . 
However, in total, iPAC identifies 9 clusters for this struc- 
ture while NMC identifies 19, with the differences shown 
in Tables 4 and 5. 

While it is outside the scope of this paper to con- 
sider all the differences between Table 4 and 5, we would 



Table 4 The significant clusters found by both and NMC 

Clusters found by both NMC and iPAC 



Start 


End 


# Muts. 


P-value 
iPAC 


NMC 


600 


600 


60 


3.73 E-130 


3.84 E-130 


469 


600 


70 


9.76 E-122 


5.63 E-16 


600 


601 


62 


3.10E-79 


1.35 E-117 


597 


600 


62 


4.05 E-77 


2.20 E-105 


464 


600 


71 


1.25 E-73 


1.74 E-16 


596 


600 


64 


3.06 E-73 


8.28 E-103 


581 


600 


66 


1.99 E-51 


2.96 E-64 


600 


671 


63 


7.78 E-15 


3.54 E-28 


469 


469 


4 


7.50 E-04 


7.50 E-04 
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Table 5 The clusters that were not deemed significant by 
iPAC but were deemed significant by NMC 

Clusters dropped by iPAC 



Start 


End 


# Muts. 


NMC Pvalue 


Dy / 


OU I 


O^f 


O.Zo t - 1 Uj 


596 


601 


66 


9.97 E-102 


581 


601 


68 


8.73 E-67 


596 


671 


67 


1.10 E-31 


597 


671 


65 


1 .93 E-29 


581 


671 


69 


2.22 E-28 


464 


601 


73 


7.09 E-19 


469 


601 


72 


3.58 E-1 8 


464 


671 


74 


6.01 E-09 


469 


671 


73 


2.38 E-08 



like to point out that, contrary to iPAC, the NMC algo- 
rithm reports the two longest clusters: 1) 464-671 (p- 
value = 6.01 x 1(T 9 ) and 2) 469-671 (p-value = 2.38 x 
10 -8 ). After alignment of the structure as described in 
Section "Obtaining the 3D structural data" we only have 
structural information on amino acids 448 - 723. Thus, the 
largest cluster detected by NMC covers ^ 75% of all the 
amino acids that we are considering. However, by taking 
into account the 3D structure of the protein, these ultra- 
long clusters are dropped and the clusters where iPAC and 
NMC overlap show 2 distinct areas of the protein, amino 
acids 464-600 and 600-671. As expected, as the majority 
of mutations occur on amino acid 600, both NMC and 
iPAC declare that the "cluster" located at amino acid 600 
is highly significant. 

Further, as described below, by considering only the 
clusters when taking into account the 3D structure 
(see Figure 7), the results again tend to fall along 
pathological function. After applying the methodology 
described in Section "Obtaining mutational data", the 
mutations that were found to be in significant clusters 
included G464V, G466V, G469V, G469A, N581S, G596R, 
L597V, LV597R, V600E, V600K, K601N and R671Q. As 
R671Q was found in only one sample within the COSMIC 
database and does not have extensive literature, we will 
not include it in further discussion. Taking into account 
the 3 most significant clusters picked up by iPAC and 
NMC, we now consider the protein in 3 parts: A) Residues 
469 - 599, B) Residue 600 and C) Residue 601 (we have 
slightly adjusted the clusters displayed in Table 4 to avoid 
overlap). The mutations listed that fall with region A, cor- 
respond primarily to lung and colorectal cancer [2,45-49]. 
Region B, which is comprised of only amino acid 600 is 
by far the most common mutation with BRAF. This muta- 
tion results in constitutive and elevated kinase activity 
and has been found in a large range of cancers including 



colorectal carcinoma, ovarian serous carcinoma, meta- 
static melanoma and pilocytic astrocytoma. Further, 
supporting the hypothesis that somatic clusters might 
provide pharmacological targets, it has already been 
shown that suppression of this cluster in melanoma 
causes tumor growth arrest and helps promote apopto- 
sis [2,8,48,50-52]. Finally, the K601N mutation in region 
C has been found in multiple myeloma patients who also 
may benefit from BRAF inhibitors [53]. 

Conclusion 

In this paper, we extended the existing methodology avail- 
able to find somatic mutation clustering by utilizing the 
information provided in the protein tertiary structure. In 
doing so, we showed that we are able to find both new pro- 
teins with clustering as well as new clusters in previously 
found proteins. We have also shown that by taking into 
account 3D structure, we are able to remove clusters that 
do not have biological meaning. The method is fast and 
robust, with the vast majority of proteins analyzed within 
5-10 minutes when executed on a desktop with 8 GB of 
DDR3 RAM and an Intel i7 3600k processor running at 
a frequency of 3.40 GHZ. Further, as the underlying cal- 
culation relies upon the NMC algorithm, a preset fixed 
window size is not required which allows for the detec- 
tion of clusters of various lengths [6] . We have also shown 
that by employing a completely statistical methodology, 
we are able to identify mutations that, when suppressed 
via pharmacological intervention, may stop further tumor 
growth. 

This methodology, while an improvement on the NMC 
method, still suffers from some limitations. First, the 
mutation status of all the amino acids must be determined 




Figure 7 The 3TV4 structure color coded by region: 1 ) Amino 
464-600 are light blue 2) Amino Acids 601 -671 are orange. The 

a-carbons of the mutated amino acids 464, 466, 469, 581 , 596, 597, 
601 and 671 are shown as purple spheres. Amino acid 600 is colored 
red (see Table 4 for details on each cluster). 
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although with the advent of high-throughput sequencing, 
this will become less of an issue as time progresses. Also, 
both hypermutability of genomic locations and unequal 
rates of mutagenesis might violate the assumption that 
each amino acid has a uniform mutation probability. For 
instance, it is well known that hypermutable positions for 
both somatic and germline mutations exist. Insertions and 
deletions that are typically sequence dependent have been 
removed from the analysis and only missense substitu- 
tions of single amino acids have been kept in this study 
to help reduce such uniformity violations. Similarly, CpG 
dinucleotides can have mutational frequency that is ten 
times or more that of other dinucleotides [54]. However, 
less than 13% of the mutations used to find clustering in 
Sections "iPAC finds novel proteins", "iPAC finds addi- 
tional clusters", and "iPAC finds fewer clusters than NMC" 
were in CpG sites. Further, as described by [6], tobacco 
smoking preferentially causes transversions in lung can- 
cer while the mutational landscape for colorectal cancer 
has more transitions [55]. Nevertheless, in the context of 
KRAS, the vast majority of mutations occur on amino 
acids 12, 13 and 61 for both lung and colorectal cancer. 
This suggests that while the mutational spectrum may be 
different, it does not have a large effect on the position of 
mutations and thus the uniformity assumption. As with 
previous studies, while this analysis is influenced by non- 
random factors, it nonetheless appears that selection of a 
cancer phenotype is the primary cause of clustering. 

It should also be noted that while iPAC is designed to 
take tertiary structure into account, it is only able to do 
so by appealing to the MDS methodology. Future research 
is required in order to relax this restriction to potentially 
identify additional clustering results. Next, as we obtained 
our mutational data from COSMIC, some tissues types 
are over or under-represented. However, such situations 
would make our analysis more conservative and the clus- 
ters we find even more significant. If different tissue types 
host mutations in different parts of the protein, aggre- 
gating over all tissue types will result in a larger value 
of n while the value of k and i for two specific muta- 
tions (as seen in Equation 2) would remain the same. This 
results in a higher p-value, implying that clusters that are 
found to be significant after collapsing over tissue type 
would be even more so if only a specific tissue type was 
analyzed. 

Finally, as shown in Section 'Results and discussion", 
iPAC finds fewer clusters for a significant percentage of 
the structures analyzed. This reduction in total clusters 
can come from two sources: the removal of some amino 
acids due to lack of tertiary position information or that 
the cluster is no longer found to be significant when 3D 
structure is taken into account. The first source, while 
already rare will become even more so in the future as 
more detailed structural information becomes available. 



As for the second source, when a cluster is not identi- 
fied under iPAC when compared to NMC, an overlapping 
or nearby cluster is typically found (as shown in Tables 4 
and 5). For BRAF specifically, there was a total of 3 struc- 
tures where iPAC found fewer clusters than NMC. Fur- 
ther, every "possibly" or "probably damaging" mutation, 
as categorized by PolyPhen-2 [14], was still represented 
in at least one cluster in each structure. Thus, in the case 
of BRAF, none of the damaging mutations identified by 
PolyPhen-2 were lost. For a more detailed analysis, please 
see "Additional file 6: Potential Driver Loss". Ultimately, 
further research is required to further reduce the possi- 
bility of losing driver mutations while taking into account 
tertiary structure. 

In conclusion, we present an approach that extends 
current methodology to identify mutation clustering by 
taking into account protein tertiary structure. We fur- 
ther show that by taking into account tertiary structure 
we are able to detect clusters that would otherwise be 
missed. Next, we demonstrate that for some of the clus- 
ters found, pharmacological intervention has already been 
successfully applied, further confirming the hypothesis 
that mutational clustering might point to activating driver 
mutations. As additional protein structures continue to be 
solved, iPAC would be able to rapidly perform a statisti- 
cal analysis to identify such potential mutations. Finally, as 
we gain a better understanding of the tertiary structure of 
DNA, this method might also have applications to finding 
mutational clustering on the DNA level. 

Endnotes 

a For this analysis, we included included mutational and 
positional data only on residues 1-167. No 3D positional 
information was available in the 3GFT structure on 
residues 168-188, and these residues were removed 
before the analysis. Further, the structural information 
has amino acid 61 as a histidine (isoform 2B for KRAS in 
the Uniprot Database) while the COSMIC database has a 
glutamine in that position. However, as the substitution 
of one amino acid in the structure for another would not 
have a significant affect on its spatial orientation and as 
amino acid 61 has a large number of somatic mutations, 
it was kept in the analysis. 



Additional files 



Additional file 1 : Cosmic Query. The SQL query used to extract the 
mutations from COSMIC. 

Additional file 2: Structure Files. A detailed list of which 
protein-structure combinations were used and what side-chains were 
selected. 

Additional file 3: Results Summary. A summary of each structure's most 
significant p-value for both iPACand NMC. 
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Additional file 4: Relevant Sites. A review showing which of the iPAC 
clusters fall within structurally relevant sites. 

Additional file 5: Performance Validation. In-depth results validating 
the iPAC results using PolyPhen-2 and CHASM. 

Additional file 6: Potential Driver Loss. An analysis of whether any 
potential driver mutations are lost when iPAC finds fewer clusters than NMC. 
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